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Abstract — In this paper, we investigate the redundancy of 
universal coding schemes on smooth parametric sources in the 
_^«ite-length regime. We derive an upper bound on the probability 
of the event that a sequence of length n, chosen using Jeffreys' 
prior from the family of parametric sources with d unknown 
parameters, is compressed with a redundancy smaller than 
(1 e)|logn for any e > 0. Our results also confirm that 
for large enough n and d, the average minimax redundancy 
provides a good estimate for the redundancy of most sources. 
Our result may be used to evaluate the performance of universal 
source coding schemes on finite-length sequences. Additionally, 
we precisely characterize the minimax redundancy for two-stage 
codes. We demonstrate that the two-stage assumption incurs 
a negligible redundancy especially when the number of source 
parameters is large. Finally, we show that the redundancy is 
significant in the compression of small sequences. 

I. Introduction 

Recently, there has been a tremendous increase in the 
amount of data being stored in the storage systems. The re- 
dundancy present in the data may be leveraged to significantly 
reduce the cost of data maintenance as well as data transmis- 
sion. In many cases, however, the data consists of several small 
files that need to be compressed and retrieved individually, i.e., 
a finite-length compression problem. Moreover, different data 
sets may be of various natures, hence little a priori assumptions 
may be made regarding the probability distribution of the data, 
i.e., universal compression. This necessitates the study of the 
universal compression of finite-length sequences. 

In this paper, we investigate the universal compression of 
smooth parametric sources. Denote ^ as a finite alphabet. Let 
C„ : A"' {0,1}* be an injective mapping from the set 
of the sequences of length n over A to the set {0, 1}* of 
binary sequences. We use the notation a;" — {xi, x„) e A" 
to present a sequence of length n. Let 6 = {8i, ...,9d) he a d- 
dimensional parameter vector. Denote /le as the parametric 
information source with d unknown parameters, where fig 
defines a probability measure on any sequence e A"^. De- 
note as the family of sources with d-dimensional unknown 
parameter vector 6. Let Hn{0) be the source entropy given 
parameter vector 6, i.e., 

(1) 

'Throughout this paper all expectations are taken with respect to the true 
unknown parameter vector 9. 



In this paper log(a:) always denotes the logarithm of x in base 
2. Let l{Cn,x") = /„(a;") denote the length function that 
describes the codeword length associated with the sequence 
x". Denote L„ as the set of all regular length functions on an 
input sequence of length n. 

Denote Rn{ln,d) as the expected redundancy of the code 
on a sequence of length n, defined as the difference between 
the expected codeword length and the entropy, 

R„{l„,e)=-El„{X")-Hr,{9). (2) 

The expected redundancy is always non-negative. For a code 
that asymptotically achieves the entropy rate with length func- 
tion /„, ^i?„(/,i, 0) as n -> oo for all 6. The maximum 
expected redundancy for a length function of a code with 
length function /„ is given as -R„(/„) = maxg^Qd i?„(Z„,0), 
which may be minimized over all codes to achieve the mini- 
max expected redundancy ||T|-|l3l 

Rn = min max i?„(Z„, 0). (3) 

The leading term in the average minimax redundancy is 
asymptotically | log n. Rissanen demonstrated that for the uni- 
versal compression of the family of the parametric sources 
with parameter vetcor 6, the redundancy of the codes with 
regular length functions /„ is asymptotically lower bounded by 
Rn{ln, 0) > (l-e)f logn 14J-L6J, for all e > and almost all 
sources. This asymptotic lower bound is tight since there exist 
coding schemes that achieve the bound asymptotically fSl, fT|. 
This result was later extended in |8|-|10| to more general 
classes of sources. In [11], we extended Rissanen's proba- 
bilistic treatment of redundancy to the universal compression 
in^«/fe-length memoryless sources for the family of two-stage 
codes. However, the two-stage code assumption is restrictive 
and incurs an extra redundancy. 

In this paper, we extend our previous work to the family 
of parametric sources. We also relax the two-stage codeing 
constraint by considering conditional two-stage codes so that 
the coding scheme is optimal in the sense that it achieves the 
minimax redundancy. Further, we derive the extra redundancy 
incurred of two-stage codes. The rest of this paper is organized 
as follows. In Section after a review of the previous work, 
we formally state the problem of redundancy for finite-length 
universal compression of parametric sources. In Section |lll] 
we present our main results on the compression of conditional 
two-stage and two-stage codes. In Section lTVl we demonstrate 
the significance of our results through several examples. 



II. Background Review and Problem Statement 

In this section, after a brief review of the previous work, 
we state the finite-length redundancy problem. Let l^^ denote 
the (non-universal) length function induced by a parameter 
9 G G''. Denote /„ as the length function on the input sequence 
of length n. Denote i?„(/,i, 6) as the expected redundancy of 
the universal compression of source jig £ using the length 
function Z„. Let /„(0) be the Fisher information matrix for 
parameter vector 6 and a sequence of length n, 

in{e) - {iliif}} = n^El^^iog 



n log e 



dose. 



1 



(4) 



Fisher information matrix quantifies the amount of informa- 
tion, on the average, that each symbol in a sample sequence of 
length n from the source conveys about the source parameters. 
In this paper, we assume that the following conditions hold: 

1) Q'^ forms a compact set. 

2) lim„^oo IniO) exists and the limit is denoted by I{9). 

3) All elements of the Fisher information matrix /„(6') are 
continuous in 9"^. 

4) /e,|/(0)|^d0<^. 

5) The family has a minimal representation with the 
d-dimensional parameter vector 0. 

Rissanen proved an asymptotic lower bound on the universal 
compression of an information sources with d parameters 
as liJ, L6J: 

Fact 1 For all parameters 9, except in a set of asymptotically 
Lebesgue volume zero, we have 



lim —J > 1 



e, Ve>0. 



(5) 



While Fact [T] describes the asymptotic fundamental limits of 
the universal compression of parametric sources, it does not 
provide much insight for the case of^n/fe-length n. Moreover, 
the result excludes an asymptotically volume zero set of 
parameter vectors 9 that has non-zero volume for any finite n. 

In Clarke and Barron derived the expected minimax 
redundancy Rn for memoryless sources, later generalized 
in lfT2l by Atteson for Markov sources, as the following: 

Fact 2 The average minimax redundancy is asymptotically 
given by 

d. 





+ log/ 


(a 







(6) 



The average minimax redundancy characterizes the maximum 
redundancy over the space Q'^ of the parameter vectors. 
However, it does not say much about the rest of the space 
of the parameter vectors. It is known that if is a 

measurable function of 9 for all x", the average minimax 
redundancy is equal to the capacity of the channel between 
the parameter vector 9 and the sequence x", i.e., Rn — 
supp X"), where p(-) is a probability measure on the 



space of the parameter vector 9 ID, lfT3l . The average minimax 
redundancy is obtained when the parameter vector 9 follows 
the capacity achieving prior, which is Jeffreys' prior in the 
case of parametric sources. Jeffreys' prior is given by [|2l 



p(9) 



j\I{\)\HX 



(7) 



In a two-stage code, to encode the sequence a;" the com- 
pression scheme attributes m bits to identify an estimate for 
the unknown source parameters. Then, in the second stage of 
the compression, it is assumed that the source with the esti- 
mated parameter has generated the sequence. In this case, there 
will be 2'" possible estimate points in the parameter space 
for the identification of the source. Let <I>™ = {0i, ...,02™} 
denote the set of all estimate points with an m-bit estimation 
budget. Note that for all i, we have (p., G 6^^ [14J, [15 1. 

Denote I'^p as the two-stage length function for the com- 
pression of sequences of length n. For each sequence x", there 
exists an estimate point in the set of the estimate points, i.e., 
7 = 7(2;", m) e <I>"\ which is optimal in the sense that it 
minimizes the code length and the average redundancy. In 
other words, 7 is the maximum likelihood estimation of the 
unknown parameter in the set of the estimate parameters, i.e.. 



7 = arg min log ( ^— 

\l^4,AX ) 



= arg max pd>i{x"). (8) 

(/lie*™ 



The two-stage universal length function for the sequence x" 
is then given by 

llP{xn^m + lZ{x-), (9) 

where Q denotes the length function induced by the parameter 
7 G $™. Let L^P be the set of all two-stage codes that could 
be described as in (|9]l. Further denote fij{x"-) as the probability 
measure induced by 7. 

Increasing the bit budget m for the identification of the 
unknown source parameters results in an exponential growth 
in the number of estimate points, and hence, smaller /^(x") 
on the average due to the more accurate estimation of the 
unknown source parameter vector. On the other hand, m 
directly appears as part of the compression overhead in (|9]l. 
Therefore, it is desirable to find the optimal m that minimizes 
the total expected codeword length, which is EZ^^'(X") = 
m + E/7,(X"). 

In this paper, we ignore the redundancy due to the integer 
constraint on the length function. Thus, we use the Shannon 
code for each estimated parameter to bound the average 
redundancy of two-stage codes. Thus, ignoring the integer 
constraint on the codeword length we have 



Hn{9). (10) 



Further, let R^p denote the average minimax redundancy of 
the two-stage codes, i.e.. 



RlP= min maxi?„(e,^)- 



(11) 



In a two-stage code, we already have some knowledge about 
the sequence x" through the optimally estimated parameter 
7(2;") (maximal likelihood estimation) that can be leveraged 
for encoding x" using the length function ^J^(a;"). The two- 
stage length function in ^ defines an incomplete coding 
length, which does not achieve the equality in Kraft's inequal- 
ity. Thus, it is not optimal in the sense that it does not achieve 
the optimal compression among all length functions. Further, it 
does not achieve the average minimax redundancy [111 , ifTSl . 
Conditioned on 7(0;"), the length of the codeword for may 
be further decreased lfT4l . 

Let Sjni'y) be the collection of all x" for which the 
optimally estimated parameter is 7, i.e., 

5™ (7) = {x" e A" : fi^ix") > ti^^ix") e $™} . 

(12) 

Further, let Amij) denote the total probability measure of all 
sequences in the set Smil), i-e.. 



(7) 



(13) 



Thus, the knowledge of 7(2;") in fact changes the probability 
distribution of the sequence. Denote /i^(a;"|x" e Smij)) as 
the conditional probability measure of a;" given 7 is known 
to be such that x" G Sm{j), i.e., the probability distribution 
that is normalized to Am{'~f). That is 



(14) 



Note that £ Sm{j)) > f^-yix") due to the fact 

that A„,.(7) < 1. Let ll{x'"'\x'"' G S„i{j)) be the codeword 
length corresponding to the conditional probability distribu- 
tion, which is decreased to E log (^^-^jrj^ry^'^ ■ Denote I^p as 
the conditional two-stage length function for the compression 
of sequences of length n using the normalized maximum 
likelihood, which is given by 



ZfP = m + ?;((x"|x" g5™(7)). 



(15) 



Therefore, the average redundancy of the conditional two- 
stage scheme is given by 

J?„(e,g) = m + Elog f "^'-y;^"^^ ) -H^{0). (16) 



Denote L'^p as the set of the conditional two-stage codes 
that are described using (flSl l. Let R'^p denote the average 
minimax redundancy of the conditional two-stage codes, i.e.. 



i?"2p= min max R J l?^P, I 



(17) 



Rissanen demonstrated that this conditional version of two- 
stage codes is in fact optimal in the sense that it achieves the 
average minimax redundancy ||T6| . In other words, Rf^P — 
Rn, where Rn is the average minimax redundancy in (|6]l. 



III. Main Results on the Redundancy 

In this section, we present our main results on the com- 
pression of parametric sources. The proofs are omitted due 
to the lack of space. We derive a lower bound on the proba- 
bility of the event that a parametric source P is compressed 
with redundancy greater than the redundancy level Rq, i.e., 
P[i?„(Z„, 6) > i?o]. This bound demonstrates the fundamental 
limits of the universal compression for finite-length n. The 
following is our main result: 

Theorem 1 Assume that the parameter vector follows Jef- 
freys' prior in the universal compression of the family of 
parametric sources V^. Let € be a real number Then, 



Rn(l 



c2p 



f logn 



> 1 -e 



> 1 - 



J\i{e)\Ue 



2tt\ 
W ) 



— ■ (18) 



This theorem is derived for the conditional two-stage length 
functions. Note that Fact[T]is readily deduced from Theorem[T| 
by letting n ^ 00. 

Next, we characterize the redundancy of two-stage codes. 
Let ZjjP be the two-stage length function as defined in 
Further, denote R^pit^ . 9) as the expected redundancy of the 
universal compression for the source P G with parameter 
vector 6 using I^p. The following theorem sets a lower bound 
on the redundancy of two-stage codes. 

Theorem 2 Consider the universal compression of the family 
of parametric sources with the parameter vector 9 that 
follows Jeffreys ' prior Let e be a real number Then, 



J \ogn 



> 1 



> 1 



J \I{9)\U9 \en 



-7 : (19) 



where Cd is the volume of the d-dimensional unit ball, which 



Cd 



r( 



(20) 



Further, we precisely characterize the extra redundancy due 
to the two-stage assumption on the code as follows: 

Theorem 3 In the universal compression of the family of 
parametric sources V^, the average minimax redundancy of 
two-stage codes is obtained by 



RlP = Rn + g(d) + O 



(21) 



Here, Rn is the average minimax redundancy defined in @ 
and g(d) is the two-stage penalty term given by 



„<„ = iogr|| + i)-fi„g(| 



(22) 
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Fig. 1 . Average redundancy of the conditional two-stage codes {c2p) 
and the average minimax redundancy (Minimax) as a function of the 
fraction of sources Pq with Rn{lf^^,9) > Rq. Memoryless source 
Ml with = 3 and d = 2. 



10' 



P[R„{lf,6) > Ro] > Po 



■ n = 12 (c2p) 

n = 12 (Minimax) 

n = 50 (c2p) 

n = 50 (Minimax) 

■ n = 202 (c2p) 

n = 202 (Minimax) 

n = 811 (c2p) 
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Fig. 2. Average redundancy of the conditional two-stage codes (c2p) 
and the average minimax redundancy (Minimax) as a function of the 
fraction of sources Po with R„{1';^^,6) > Rq. First-order Markov 
source Mi with k — 2 and d = 2. 

IV. Elaboration on the Results 

In this section, we elaborate on the significance of our 
results. In Section |TV-A| we demonstrate that the average min- 
imax redundancy underestimates the performance of source 
coding in the small to moderate length n for sources with 
small d. In Section IIV-BI we compare the performance of 
two-stage codes with conditional two-stage codes. We show 
that the penalty term of two-stage coding is negligible for 
sources with large d as well as for the sequences of long n. 
In Section ITV-CI we demonstrate that as the number of source 
parameters grow, the minimax redundancy well estimates the 
performance of the source coding. 

A. Redundancy in Finite-Length Sequences with Small d 

In Figures [1] and |2] the a;-axis denotes a fraction Pq and 
the y-axis represents a redundancy level Rq. The solid curves 
demonstrate the derived lower bound on the average redun- 
dancy of the conditional two-stage codes i?o as a function 
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Fig. 3. Average redundancy of the two-stage codes (solid) vs average 
redundancy of the conditional two — stage codes (dotted) as a function 
of the fraction of sources Po- Memoryless source A^q with k — 2 
and d= 1. 

of the fraction Pq of the sources with redundancy larger than 
Rq, i.e., P[i?„(?^2p > > jjj other words, the pair 
[Rq, Pq) on the redundancy curve means that at least a fraction 
Po of the sources that are chosen from Jeffreys' prior have an 
expected redundancy that is greater than Rq. Note that the 
unknown parameter vector is chosen using Jeffreys' prior. 

First, we consider a ternary memoryless information source 
denoted by A^q. Let k be the alphabet size, where fc = 3. 
This source may be parameterized using two parameters, i.e., 
(i = 2. In Fig. [T] our results are compared to the average 
minimax redundancy, i.e., Rn from Since the conditional 
two-stage codes achieve the minimax redundancy, _R„ is in 
fact the average minimax redundancy for the conditional two- 
stage codes (Rf^^) as well. The results are presented in bits. As 
shown in Fig.[Tl at least 40% of ternary memoryless sequences 
of length 71 = 32 (n = 128) may not be compressed beyond a 
redundancy of 4.26 (6.26) bits. Also, at least 60% of ternary 
memoryless sequences of length n = 32 (n = 128) may not be 
compressed beyond a redundancy of 3.67 (5.68) bits. Note that 
as n — )> cxo, the average redundancy approaches the average 
minimax redundancy for most sources. 

Next, let Ail denote a binary first-order Markov source 
(d — 2). We present the finite-length compression results in 
Fig. |2] for different values of sequence length n. The values 
of n are chosen such that they are almost log(3) times the 
values of n for the ternary memoryless source in the first 
example. This choice has been made to equate the amount of 
information in the two sequences from A4q and Ai^ allowing 
a fair comparison. 

Figure |2] shows that the average minimax redundancy of 
the conditional two-stage codes for the case of n — 12 is 
given as R12 « 2.794 bits. Comparing Fig. [T] with Fig.|2l we 
conclude that the average redundancy of universal compression 
for a binary first-order Markov source is very similar to that 
of the ternary memoryless source, suggesting that d is the 
most important parameter in determining the redundancy of 
finite-length sources. This subtle difference becomes even 
more negligible as n ^ 00 since the dominating factor of 
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Fig. 4. Average redundancy of the conditional two-stage codes (c2p) 
and the average minimax redundancy (Minimax) as a function of the 
fraction of sources Pq with Rn{l'^'',6) > Rq. First-order Markov 
source with k — 256 and d = 65280. The sequence length n is 
measured in bytes (B). 

redundancy for both cases approaches to | log n. 

As demonstrated in Figs. [T] and |2l there is a significant gap 
between the known result by the average minimax redundancy 
and the finite-length results obtained in this paper when a high 
fraction Pq of the sources is concerned. The bounds derived in 
this paper are tight, and hence, for many sources the average 
minimax redundancy overestimates the average redundancy 
in universal source coding of finite-length sequences where 
the number of the parameters is small. In other words, the 
compression performance of a high fraction of finite-length 
sources would be better than the estimate given by the average 
minimax redundancy. 

B. Two-Stage Codes Vs Conditional Two-Stage Codes 

We now compare the finite-length performance of the two- 
stage codes with the conditional two-stage codes on the class 
of binary memoryless source Mq with k ~ 2 (d ~ 1). The 
results are presented in Figure |3] The solid line and the dotted 
line demonstrate the lower bound for the two-stage codes and 
the conditional two-stage codes, respectively. As can be seen, 
the gap between the achievable compression using two-stage 
codes and that of the conditional two-stage codes constitutes 
a significant fraction of the average redundancy for small n. 
For a Bernoulli source, the average minimax redundancy of 
the two-stage code is given in d^TT i as 

1 



Rlr' ^R,, + t log 



Rn + 1.048. 



The average minimax redundancy of two-stage codes for the 



case of 71 = 8 is given as R. 



2p 



2.86 bits while that of 



the conditional two-stage codes is Rg « 1.82. Thus, the two- 
stage codes incur an extra compression overhead of more than 
50% for n = 8. 

In Theorem[3] we derived that the extra redundancy g{d) in- 
curred by the two-stage assumption. We further use Stirling's 
approximation for sources with large number of parameters in 
order to show the asymptotic behavior of g{d) as c? — >■ oo. 



5(d) = - log M+ 0(1). 



(24) 



Note that o(l) denotes a function of d and not n here. Finally, 
we must note that the main term of redundancy in Rn is 
|logn, which is linear in d, but the penalty term g{d) is 
logarithmic in d. Hence, the effect of the two-stage assumption 
becomes negligible for the families of sources with larger d. 

C. Redundancy in Finite-Length Sequences with Large d 

The results of this paper can be used to quantify the signifi- 
cance of redundancy in finite-length compression. We consider 
a first-order Markov source with alphabet size k — 256. 
We intentionally picked this alphabet size as it is a common 
practice to use the byte as a source symbol. This source may 
be represented using d = 256 x 255 = 62580 parameters. In 
Figure |4] the achievable redundancy is demonstrated for four 
different values of n. Here, again the redundancy is measured 
in bits. The curves are almost flat when d and n are very large 
validating our results that the average minimax redundancy 
provides a good estimate on the achievable compression for 
most sources. The sequence length in this example is pre- 
sented in bytes (B). We observe that for n — 256kB, we 
have Rn{ln,&) > 100,000 bits for most sources. Further, 
the extra redundancy due to the two-stage coding g{d) f» 
8.8 bits, which is a negligible fraction of the redundancy of 
100,000 bits. If the source has an entropy rate of 1 bit per 
source symbol (byte), the compression overhead is 38% and 
1.7% for sequences of lengths 256kB and 16MB, respectively. 
Hence, we conclude that redundancy may be significant for 
the compression of small low entropy sequences. On the other 
hand, redundancy is negligible for sequences of higher lengths. 
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